Aligned sequence reads are stored in BAM format.
Information on the content and state of BAM file is stored in its header.
The body of the BAM file hold information on the original reads.
As well as on the positions reads map to in the genome.
In our previous session we saw how to align data to the genome using an aligner and that we can use a splice aware aligner to map RNAseq reads across splice junctions.
In this session we will work with aligned data as BAM files.
I have provided BAM files we saw in our last session as our test data from today.
This can be found in Data/liver.bodyMap.bam
We introduced the Rsamtools package in our last session to help us post process our newly aligned BAM files.
The Rsamtools package is the basis for many R/Bioconductor packages working with alignments in BAM format including another package we will work with today, the GenomicAligments package.
First we can load the Rsamtools package.
library(Rsamtools)We saw in our last session we can sort files using the sortBam() function. This function returns the name of sorted BAM file.
By default the file is sorted by chromosome name and then by coordinates of reads within these chromosomes.
coordSorted <- sortBam("../../Data/liver.bodyMap.bam",
"Sorted_liver")
coordSorted## [1] "Sorted_liver.bam"
Some external programs will require reads be sorted by read name, not coordinates. To sort by read name we can set the sortBam arguement byQname to TRUE.
readnameSorted <- sortBam("../../Data/liver.bodyMap.bam",
"SortedByName_liver",
byQname=TRUE)
readnameSorted## [1] "SortedByName_liver.bam"
We can control how much memory we use with the maxMemory parameter. This allows to sort very large files on smaller memory computers (such as our laptops).
The maxMemory is specified as the maximum MB of RAM which our sortBam() function call can use. In sorting Rsamtools will produce mulitple smaller BAM files, the smaller the maxMemory value the greater then number of temporary files.
Here in this example, we sort our file in 1MB of memory and it will produce 19 tempory files.
coordSorted <- sortBam("../../Data/liver.bodyMap.bam",
"Sorted_liver",
maxMemory=1)## Warning in .local(file, destination, ...): [bam_sort_core] merging from 19
## files...
coordSorted## [1] "Sorted_liver.bam"
Once we have a coordinate sorted file we can index these files to allow for use in other programs such as IGV.
Importantly for us, an indexed BAM file allows us to access information from a BAM file by genomic location.
indexBam("Sorted_liver.bam")## Sorted_liver.bam
## "Sorted_liver.bam.bai"
Despite the name, the quickBamFlagSummary can take a noticeable amount of time when working with large files.
To get a very quick overview of number of mapped reads we can use the indexed BAM file and the idxstatsBam() function.
myIndexStats <- idxstatsBam("Sorted_liver.bam")Importing and handling of BAM files is handled largely in the GenomicAlignments package.
We first load the package.
library(GenomicAlignments)We can retrieve the BAM header in R using the scanBamHeader() function and the name of the file we wish to access header information from.
Header information is returned as a list.
myHeader <- scanBamHeader("Sorted_liver.bam")
str(myHeader)## List of 1
## $ Sorted_liver.bam:List of 2
## ..$ targets: Named int [1:25] 249250621 135534747 135006516 133851895 115169878 107349540 102531392 90354753 81195210 78077248 ...
## .. ..- attr(*, "names")= chr [1:25] "chr1" "chr10" "chr11" "chr12" ...
## ..$ text :List of 27
## .. ..$ @HD: chr [1:2] "VN:1.0" "SO:coordinate"
## .. ..$ @SQ: chr [1:2] "SN:chr1" "LN:249250621"
## .. ..$ @SQ: chr [1:2] "SN:chr10" "LN:135534747"
## .. ..$ @SQ: chr [1:2] "SN:chr11" "LN:135006516"
## .. ..$ @SQ: chr [1:2] "SN:chr12" "LN:133851895"
## .. ..$ @SQ: chr [1:2] "SN:chr13" "LN:115169878"
## .. ..$ @SQ: chr [1:2] "SN:chr14" "LN:107349540"
## .. ..$ @SQ: chr [1:2] "SN:chr15" "LN:102531392"
## .. ..$ @SQ: chr [1:2] "SN:chr16" "LN:90354753"
## .. ..$ @SQ: chr [1:2] "SN:chr17" "LN:81195210"
## .. ..$ @SQ: chr [1:2] "SN:chr18" "LN:78077248"
## .. ..$ @SQ: chr [1:2] "SN:chr19" "LN:59128983"
## .. ..$ @SQ: chr [1:2] "SN:chr2" "LN:243199373"
## .. ..$ @SQ: chr [1:2] "SN:chr20" "LN:63025520"
## .. ..$ @SQ: chr [1:2] "SN:chr21" "LN:48129895"
## .. ..$ @SQ: chr [1:2] "SN:chr22" "LN:51304566"
## .. ..$ @SQ: chr [1:2] "SN:chr3" "LN:198022430"
## .. ..$ @SQ: chr [1:2] "SN:chr4" "LN:191154276"
## .. ..$ @SQ: chr [1:2] "SN:chr5" "LN:180915260"
## .. ..$ @SQ: chr [1:2] "SN:chr6" "LN:171115067"
## .. ..$ @SQ: chr [1:2] "SN:chr7" "LN:159138663"
## .. ..$ @SQ: chr [1:2] "SN:chr8" "LN:146364022"
## .. ..$ @SQ: chr [1:2] "SN:chr9" "LN:141213431"
## .. ..$ @SQ: chr [1:2] "SN:chrM" "LN:16571"
## .. ..$ @SQ: chr [1:2] "SN:chrX" "LN:155270560"
## .. ..$ @SQ: chr [1:2] "SN:chrY" "LN:59373566"
## .. ..$ @PG: chr [1:3] "ID:TopHat" "VN:1.1.4" "CL:/home/radon00/cole/bin/tophat -p 4 -a 5 -F 0.0 -r 100 --tmp-dir=/broad/shptmp/nmcabili/rnaseq/human/HBM_hg19"| __truncated__
We can access useful information such as chromosome names and lengths as with other lists using $ accessors. The first level of list is the file names. The second levels of list are
names(myHeader)## [1] "Sorted_liver.bam"
names(myHeader$Sorted_liver.bam)## [1] "targets" "text"
The chromosome lengths are stored in a named vector under the targets list.
myHeader$Sorted_liver.bam$targets[1:10]## chr1 chr10 chr11 chr12 chr13 chr14 chr15
## 249250621 135534747 135006516 133851895 115169878 107349540 102531392
## chr16 chr17 chr18
## 90354753 81195210 78077248
The text list contains information on sorting and programs.
myHeader$Sorted_liver.bam$text## $`@HD`
## [1] "VN:1.0" "SO:coordinate"
##
## $`@SQ`
## [1] "SN:chr1" "LN:249250621"
##
## $`@SQ`
## [1] "SN:chr10" "LN:135534747"
##
## $`@SQ`
## [1] "SN:chr11" "LN:135006516"
##
## $`@SQ`
## [1] "SN:chr12" "LN:133851895"
##
## $`@SQ`
## [1] "SN:chr13" "LN:115169878"
##
## $`@SQ`
## [1] "SN:chr14" "LN:107349540"
##
## $`@SQ`
## [1] "SN:chr15" "LN:102531392"
##
## $`@SQ`
## [1] "SN:chr16" "LN:90354753"
##
## $`@SQ`
## [1] "SN:chr17" "LN:81195210"
##
## $`@SQ`
## [1] "SN:chr18" "LN:78077248"
##
## $`@SQ`
## [1] "SN:chr19" "LN:59128983"
##
## $`@SQ`
## [1] "SN:chr2" "LN:243199373"
##
## $`@SQ`
## [1] "SN:chr20" "LN:63025520"
##
## $`@SQ`
## [1] "SN:chr21" "LN:48129895"
##
## $`@SQ`
## [1] "SN:chr22" "LN:51304566"
##
## $`@SQ`
## [1] "SN:chr3" "LN:198022430"
##
## $`@SQ`
## [1] "SN:chr4" "LN:191154276"
##
## $`@SQ`
## [1] "SN:chr5" "LN:180915260"
##
## $`@SQ`
## [1] "SN:chr6" "LN:171115067"
##
## $`@SQ`
## [1] "SN:chr7" "LN:159138663"
##
## $`@SQ`
## [1] "SN:chr8" "LN:146364022"
##
## $`@SQ`
## [1] "SN:chr9" "LN:141213431"
##
## $`@SQ`
## [1] "SN:chrM" "LN:16571"
##
## $`@SQ`
## [1] "SN:chrX" "LN:155270560"
##
## $`@SQ`
## [1] "SN:chrY" "LN:59373566"
##
## $`@PG`
## [1] "ID:TopHat"
## [2] "VN:1.1.4"
## [3] "CL:/home/radon00/cole/bin/tophat -p 4 -a 5 -F 0.0 -r 100 --tmp-dir=/broad/shptmp/nmcabili/rnaseq/human/HBM_hg19/Alignments/liver_50 -o /seq/rinnlab/Moran/HBM_hg19/Alignments/liver_50 --no-novel-juncs -j /seq/rinnlab/Moran/HBM_hg19/Junctions/HBM+GencodeV4+RinnNov10.juncs /seq/rinnscratch/cole/bowtie_indexes/hg19/hg19-male /broad/shptmp/nmcabili/rnaseq/human/HBM/50bp/FCB/s_8_1_sequence.txt /broad/shptmp/nmcabili/rnaseq/human/HBM/50bp/FCB/s_8_2_sequence.txt"
We can check the order for our name sorted BAM file too.
myHeader <- scanBamHeader("SortedByName_liver.bam")
myHeader$SortedByName_liver.bam$text["@HD"]## $`@HD`
## [1] "VN:1.0" "SO:queryname"
We can see the program information by reviewing the PG element. Note that PG elements are not always complete and depend on tools used. Here we can see aligner, version used and the command line command itself.
myHeader <- scanBamHeader("Sorted_liver.bam")
myHeader$Sorted_liver.bam$text["@PG"]## $`@PG`
## [1] "ID:TopHat"
## [2] "VN:1.1.4"
## [3] "CL:/home/radon00/cole/bin/tophat -p 4 -a 5 -F 0.0 -r 100 --tmp-dir=/broad/shptmp/nmcabili/rnaseq/human/HBM_hg19/Alignments/liver_50 -o /seq/rinnlab/Moran/HBM_hg19/Alignments/liver_50 --no-novel-juncs -j /seq/rinnlab/Moran/HBM_hg19/Junctions/HBM+GencodeV4+RinnNov10.juncs /seq/rinnscratch/cole/bowtie_indexes/hg19/hg19-male /broad/shptmp/nmcabili/rnaseq/human/HBM/50bp/FCB/s_8_1_sequence.txt /broad/shptmp/nmcabili/rnaseq/human/HBM/50bp/FCB/s_8_2_sequence.txt"
Now we have an idea how our BAM was constructed, we want to start to retrieve some of the data from our BAM file.
We can use the readGAlignments() function to import the BAM data into R.
The returned object is a GAlignments object.
myReads <- readGAlignments("Sorted_liver.bam")
class(myReads)## [1] "GAlignments"
## attr(,"package")
## [1] "GenomicAlignments"
The resulting GAlignments object contains much of the information we saw in the SAM file earlier on.
This include the chromomome, strand, start and end position of alignment.
myReads[1:2,]## GAlignments object with 2 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr12 - 75M 75 98000078 98000152 75
## [2] chr12 + 50M 50 98019636 98019685 50
## njunc
## <integer>
## [1] 0
## [2] 0
## -------
## seqinfo: 25 sequences from an unspecified genome
As for GRanges objects we have the width of our ranges (end coordinate - start coordinate) in the GAlignments objects.
Additional we have the qwidth which contains information on the width of the original read.
myReads[1:2,]## GAlignments object with 2 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr12 - 75M 75 98000078 98000152 75
## [2] chr12 + 50M 50 98019636 98019685 50
## njunc
## <integer>
## [1] 0
## [2] 0
## -------
## seqinfo: 25 sequences from an unspecified genome
The GAlignments object also contains information on the cigar strings within alignments and the number of junctions a read spans in the njunc column
Cigar strings denote the matches against reference.
75M - This is 75 matches in a row.
myReads[1:2,]## GAlignments object with 2 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr12 - 75M 75 98000078 98000152 75
## [2] chr12 + 50M 50 98019636 98019685 50
## njunc
## <integer>
## [1] 0
## [2] 0
## -------
## seqinfo: 25 sequences from an unspecified genome
The GAlignments object inherits much of the functionality we have seen with GRanges object.
We can access information using the same accessors as we saw with GRanges objects.
seqnames(myReads)## factor-Rle of length 61994 with 1 run
## Lengths: 61994
## Values : chr12
## Levels(25): chr1 chr10 chr11 chr12 chr13 ... chr8 chr9 chrM chrX chrY
start(myReads)[1:2]## [1] 98000078 98019636
The GAlignments object also has some new accessors to access the cigar and njunc information using the cigar and njunc functions.
cigar(myReads)[1:2]## [1] "75M" "50M"
njunc(myReads)[1:2]## [1] 0 0
We can also index and subset the same way as with GRanges objects.
Here we only keep reads on positive strand
myReads[strand(myReads) == "+"]## GAlignments object with 31364 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end
## <Rle> <Rle> <character> <integer> <integer> <integer>
## [1] chr12 + 50M 50 98019636 98019685
## [2] chr12 + 75M 75 98048450 98048524
## [3] chr12 + 75M 75 98071725 98071799
## [4] chr12 + 75M 75 98071725 98071799
## [5] chr12 + 75M 75 98071725 98071799
## ... ... ... ... ... ... ...
## [31360] chr12 + 50M 50 99898308 99898357
## [31361] chr12 + 50M 50 99898308 99898357
## [31362] chr12 + 50M 50 99954800 99954849
## [31363] chr12 + 75M 75 99964155 99964229
## [31364] chr12 + 50M 50 99980103 99980152
## width njunc
## <integer> <integer>
## [1] 50 0
## [2] 75 0
## [3] 75 0
## [4] 75 0
## [5] 75 0
## ... ... ...
## [31360] 50 0
## [31361] 50 0
## [31362] 50 0
## [31363] 75 0
## [31364] 50 0
## -------
## seqinfo: 25 sequences from an unspecified genome
We can alter range positions using the GenomicRanges narrow() function.
Here we resize our reads to be the 5’ 1st base pair at the beginning of every read. Note that narrow() function does not take notice of strand.
The cigar strings and njunc will automatically be altered as well
my5primeReads <- narrow(myReads,start=1,width = 1)
my5primeReads## GAlignments object with 61994 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end
## <Rle> <Rle> <character> <integer> <integer> <integer>
## [1] chr12 - 1M 1 98000078 98000078
## [2] chr12 + 1M 1 98019636 98019636
## [3] chr12 + 1M 1 98048450 98048450
## [4] chr12 + 1M 1 98071725 98071725
## [5] chr12 + 1M 1 98071725 98071725
## ... ... ... ... ... ... ...
## [61990] chr12 + 1M 1 99898308 99898308
## [61991] chr12 + 1M 1 99898308 99898308
## [61992] chr12 + 1M 1 99954800 99954800
## [61993] chr12 + 1M 1 99964155 99964155
## [61994] chr12 + 1M 1 99980103 99980103
## width njunc
## <integer> <integer>
## [1] 1 0
## [2] 1 0
## [3] 1 0
## [4] 1 0
## [5] 1 0
## ... ... ...
## [61990] 1 0
## [61991] 1 0
## [61992] 1 0
## [61993] 1 0
## [61994] 1 0
## -------
## seqinfo: 25 sequences from an unspecified genome
We can control this ourselves with a little subsetting.
myReadsPos <- narrow(myReads[strand(myReads) == "+"],
start=1,width = 1)
myReadsNeg <- narrow(myReads[strand(myReads) == "-"],
end=-1,width = 1)
my5primeReads <- c(myReadsPos,myReadsNeg)
my5primeReads## GAlignments object with 61994 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end
## <Rle> <Rle> <character> <integer> <integer> <integer>
## [1] chr12 + 1M 1 98019636 98019636
## [2] chr12 + 1M 1 98048450 98048450
## [3] chr12 + 1M 1 98071725 98071725
## [4] chr12 + 1M 1 98071725 98071725
## [5] chr12 + 1M 1 98071725 98071725
## ... ... ... ... ... ... ...
## [61990] chr12 - 1M 1 99738530 99738530
## [61991] chr12 - 1M 1 99738530 99738530
## [61992] chr12 - 1M 1 99738530 99738530
## [61993] chr12 - 1M 1 99738530 99738530
## [61994] chr12 - 1M 1 99862397 99862397
## width njunc
## <integer> <integer>
## [1] 1 0
## [2] 1 0
## [3] 1 0
## [4] 1 0
## [5] 1 0
## ... ... ...
## [61990] 1 0
## [61991] 1 0
## [61992] 1 0
## [61993] 1 0
## [61994] 1 0
## -------
## seqinfo: 25 sequences from an unspecified genome
We can convert a GAlignments object to a GRanges to take advantage of other functions using the granges() function.
This is most useful when reads align continously to a genome (WGS, ChIP-seq, ATAC-seq)
myReadAsGRanges <- granges(myReads,use.mcols = TRUE)
myReadAsGRanges## GRanges object with 61994 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr12 [98000078, 98000152] -
## [2] chr12 [98019636, 98019685] +
## [3] chr12 [98048450, 98048524] +
## [4] chr12 [98071725, 98071799] +
## [5] chr12 [98071725, 98071799] +
## ... ... ... ...
## [61990] chr12 [99898308, 99898357] +
## [61991] chr12 [99898308, 99898357] +
## [61992] chr12 [99954800, 99954849] +
## [61993] chr12 [99964155, 99964229] +
## [61994] chr12 [99980103, 99980152] +
## -------
## seqinfo: 25 sequences from an unspecified genome
To convert RNA-seq reads which may span exons is a little more difficult. Since a read can potentially span multiple exons,a single read may need to be converted to multiple ranges.
To solve this we can use the grglist() function to return a GRangesList with a separate GRanges for each read.
myReadAsGRangesList <- grglist(myReads,use.mcols = TRUE)
myReadAsGRangesList[njunc(myReads) == 1]## GRangesList object of length 8261:
## [[1]]
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr12 [98880911, 98880966] +
## [2] chr12 [98884772, 98884790] +
##
## [[2]]
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## [1] chr12 [98880911, 98880966] +
## [2] chr12 [98884772, 98884790] +
##
## [[3]]
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## [1] chr12 [98880934, 98880966] +
## [2] chr12 [98884772, 98884788] +
##
## ...
## <8258 more elements>
## -------
## seqinfo: 25 sequences from an unspecified genome
We can convert GRanges back to a GAlignments object using the function as(myGranges,“GAlignments”)
myReadAsGRanges <- granges(myReads,use.mcols = TRUE)
myReadsAgain <- as(myReadAsGRanges,"GAlignments")
myReadsAgain## GAlignments object with 61994 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end
## <Rle> <Rle> <character> <integer> <integer> <integer>
## [1] chr12 - 75M 75 98000078 98000152
## [2] chr12 + 50M 50 98019636 98019685
## [3] chr12 + 75M 75 98048450 98048524
## [4] chr12 + 75M 75 98071725 98071799
## [5] chr12 + 75M 75 98071725 98071799
## ... ... ... ... ... ... ...
## [61990] chr12 + 50M 50 99898308 99898357
## [61991] chr12 + 50M 50 99898308 99898357
## [61992] chr12 + 50M 50 99954800 99954849
## [61993] chr12 + 75M 75 99964155 99964229
## [61994] chr12 + 50M 50 99980103 99980152
## width njunc
## <integer> <integer>
## [1] 75 0
## [2] 50 0
## [3] 75 0
## [4] 75 0
## [5] 75 0
## ... ... ...
## [61990] 50 0
## [61991] 50 0
## [61992] 50 0
## [61993] 75 0
## [61994] 50 0
## -------
## seqinfo: 25 sequences from an unspecified genome
One very good reason to convert our GRanges objects back to a GAlignments object is so we can export our modified reads back to a BAM.
We can use the rtracklayer packages export() function to export our GAlignments file to a BAM file.
library(rtracklayer)
export(my5PrimeAsReads,con="myModifiedReads.bam")We can specify to import information from only specific regions by providing a GRanges of regions of interest to the which parameter in the ScanBamParam() function.
myRanges <- GRanges("chr12", IRanges(98987153,98988394))
myParam <- ScanBamParam(which=myRanges)
myParam## class: ScanBamParam
## bamFlag (NA unless specified):
## bamSimpleCigar: FALSE
## bamReverseComplement: FALSE
## bamTag:
## bamTagFilter:
## bamWhich: 1 ranges
## bamWhat:
## bamMapqFilter: NA
We can also control the information we import using the what parameter in ScanBamParam() function.
Here we import the read name, sequence and qualities.
myParam <- ScanBamParam(what=c("qname","seq","qual"))
infoInReads <- readGAlignments("Sorted_liver.bam",param = myParam)
infoInReads[1]## GAlignments object with 1 alignment and 3 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr12 - 75M 75 98000078 98000152 75
## njunc | qname
## <integer> | <character>
## [1] 0 | HWI-BRUNOP16X_0001:8:1:16461:187530#0
## seq qual
## <DNAStringSet> <PhredQuality>
## [1] TTTAATTGTG...TATAAATTNN [[eeeY[Y\\^...EEGGJIJJBB
## -------
## seqinfo: 25 sequences from an unspecified genome
We can access this additional information as we would in GRanges objects by using the mcols function.
mcols(infoInReads)## DataFrame with 61994 rows and 3 columns
## qname seq
## <character> <DNAStringSet>
## 1 HWI-BRUNOP16X_0001:8:1:16461:187530#0 TTTAATTGTG...TATAAATTNN
## 2 HWI-BRUNOP16X_0001:8:24:8092:17512#0 CTGGGACTAC...TTGTATCTTT
## 3 HWI-BRUNOP16X_0001:8:65:19457:112370#0 GCTGAGTTCA...TGGGGGGTTA
## 4 HWI-BRUNOP16X_0001:8:2:18583:93458#0 NNCAAGATGG...GCAATATCAT
## 5 HWI-BRUNOP16X_0001:8:23:3549:142929#0 NNCAAGATGG...GCAATATCAT
## ... ... ...
## 61990 HWI-BRUNOP16X_0001:8:26:10787:63093#0 CAGCTGTGTC...TGAAGGAAAT
## 61991 HWI-BRUNOP16X_0001:8:68:4416:97075#0 CAGCTGTGTC...TGAAGGAAAT
## 61992 HWI-BRUNOP16X_0001:8:21:20052:32589#0 ANTTTTAGTA...GGTATCAATC
## 61993 HWI-BRUNOP16X_0001:8:22:4306:35162#0 NNGGCTCACG...TCAAGACCAT
## 61994 HWI-BRUNOP16X_0001:8:41:20739:121715#0 CNTTACTCCA...GACCACCTCA
## qual
## <PhredQuality>
## 1 [[eeeY[Y\\^...EEGGJIJJBB
## 2 ___]_^]^^_...SSSPP`^c^_
## 3 bff`bddd`d...BBBBBBBBBB
## 4 BBHHHHEEHH...[eeeeYY[\\\\
## 5 BBHHHHEEHH...YWYWWYYY\\X
## ... ...
## 61990 fgfggggfgg...gheggdgggg
## 61991 fffffbcb[b...cfffffffcb
## 61992 TBTTFQSPNT...BBBBBBBBBB
## 61993 BBIIJIIGGG...BBBBBBBBBB
## 61994 TBTTTggggg...gggggggggg